forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ray-project:master #56
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ng information about individual connector pieces to the `ctor_args_and_kwargs` file). (#48213)
…ingzoo 1.24.3 supersuit 3.9.3). (#45328)
…mum_gcs_destroyed_actor_cached_count (#48229) Signed-off-by: dayshah <[email protected]>
…e it is iterating with each iterable value. (#48270) Signed-off-by: win5923 <[email protected]>
/salute Signed-off-by: Edward Oakes <[email protected]>
## Why are these changes needed? Refactor handle options. - Change `_options` and `copy_and_update` to use `**kwargs` instead of listing out the entire set of handle options, so that in the future adding new options only requires code change in two locations: (1) the public facing `options()` API, and `_HandleOptions` itself. - Separate handle options into "dynamic" and "init" options. Dynamic options can be modified in a lightweight manner with `.options()`, while init options can only be modified once, before a handle is initialized, with `.init()`. - Initializing a handle is equivalent to instantiating the lazily-created router. - `.options()` and `.remote()` implicitly initialize a handle. --------- Signed-off-by: Cindy Zhang <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Modifies the `route` attribute used for logs and metrics to limit its cardinality (see linked issue). After this change, the behavior will be: - If the user is not using FastAPI, the `route` will always be the `route_prefix`. - If the user is using FastAPI, the `route` will be the `route_prefix` + the matched route string. Take the following example: ``` @serve.deployment @serve.ingress(fastapi_app) class A: @fastapi_app.get("/{user_id}") def dynamic_subpath(self, user_id: str) -> str: return ... serve.run(A.bind(), route_prefix="/prefix") ``` Previously, this could have infinite cardinality in the metrics tags (a unique tag value for each user). With this change, the tag would instead be: `"/prefix/{user_id}"`. The "not found" response has also been modified to *not* include a `route` tag. TODO: - [x] Add unit tests for `get_asgi_route_name` - [x] Add integration tests to `test_metrics.py` - [x] Add tests for the replica request context ## Related issue number Closes #47999 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Edward Oakes <[email protected]>
…try` (#48234) Signed-off-by: Alexey Kudinkin <[email protected]>
…ING_OR_CONCURRENCY TaskStatus (#48242) Signed-off-by: Jiajun Yao <[email protected]>
…eployment level" (#48292)" (#48294) This reverts commit 703194a. The original commit caused `test_deploy_2.py` to become flaky. It appears this was a timing issue due to the 10s timeout set on the test. I've removed this, as it looks like a very old test case that is no longer relevant. --------- Signed-off-by: Edward Oakes <[email protected]>
… work on head node (#48103) This PR updates the compute configs for benchmark release tests to not schedule workers onto the head node. This reflects the best practice not to schedule heavy work on the head node for cluster stability. --------- Signed-off-by: Justin Yu <[email protected]>
## Why are these changes needed? depends on #48232 Pull out some router init parameters and router init logic into `create_router`. --------- Signed-off-by: Cindy Zhang <[email protected]>
## Why are these changes needed? Move `test_deployment_scheduler.py` from small to medium. Signed-off-by: Cindy Zhang <[email protected]>
If args is not a list, an exception will be thrown in the above if statement.
…ate_failed_env_runners=False` to `True`). (#48286)
## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <[email protected]>
Stacked on: #48223 Reimplements the `.bind()` codepath to avoid using the Ray DAG codepath which adds a lot of complexity for little benefit. The DAG traversal code is much simpler, around 100 lines of self-contained code in `build_app.py`. I've also added unit tests for it. There should be no behavior change here. --------- Signed-off-by: Edward Oakes <[email protected]>
…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)
Previously the error message was injected directly as the HTTP status message Signed-off-by: Richo Healey <[email protected]>
## Why are these changes needed? fix router.py Signed-off-by: Cindy Zhang <[email protected]>
…g compilation (#47757) Leaf nodes are nodes that are not output nodes and have no downstream nodes. If a leaf node raises an exception, it will not be propagated to the driver. Therefore, this PR raises an exception if a leaf node is found during compilation. Another solution: implicitly add leaf node to MultiOutputNode Currently, the function execute can return multiple CompiledDAGRefs. The UX we want to provide is to implicitly add leaf nodes to the MultiOutputNode but not return the references of the leaf nodes. For example, a MultiOutputNode is containing 3 DAG nodes (2 normal DAG nodes + 1 leaf node). x, y = compiled_dag.execute(input_vals) # We don't return the ref for the leaf node. However, the ref for leaf node will be GC(ed) in execute, and CompiledDAGRef’s del will call get if it was never called which makes execute to be a sync instead of an async operation which is not acceptable.
…#48358) Some tests are failing when we duplicate the original release tests. Marking them as unstable to be non-blocking. Will mark as stable once they are truly stable. Signed-off-by: Rui Qiao <[email protected]>
See #48228 (comment). Signed-off-by: Balaji Veeramani <[email protected]>
GcsServer manages several IO Contexts and their threads, and hard codes which class uses which IO Context. This is not flexible in that we can't easily declare another IO Context and adapt some classes without a pervasive code change. Introduces an IoContextProvider that manages IO Contexts based on a static, compile-time Policy, whose author can determine class-IO Context mapping. This way we can change the mappings in a central place without big chunk of changes everywhere. Signed-off-by: Ruiyang Wang <[email protected]>
To make the visualization more concise and easier to read: Show nodes in concise form and show actor information in the subgraph instead.
…uf (#48593) Signed-off-by: dentiny <[email protected]>
Signed-off-by: hjiang <[email protected]>
…U loader thread, enable local learner w/ GPU. (#48314)
and remove linux wheel testing logic. linux is not using this script anymore. Signed-off-by: Lonnie Liu <[email protected]>
… Consistent (#48088) Signed-off-by: Mengjin Yan <[email protected]>
The current graph shows memory usage of each node along side the MAX memory across the cluster. For oom debugging, we care more about memory utilization (percentage) on each node.
…ing tasks (#48528) Rename "Active tasks by name" to "Requested Live tasks". Make it clear that they do not only refer to running tasks but any tasks that has been requested to be scheduled. Add a graph for "Running tasks by name" that only includes "Running_*" status Rename "Active actors by name"
Fixes #48630 Signed-off-by: Balaji Veeramani <[email protected]>
In a recent investigation, we saw `Broken pipe` issue when a node is being preempted and going through the shutdown process. This is issue is due to a race condition: * The node is being preempted and is being gracefully shut down. * At the same time, remote hosts receives the notifications that the node state changes and send `FreeObjects` requests to the node * When handling the `FreeObjects` request, the Raylet will call `Delete` on the local Object store. * However, when the `Delete` happens, the object store has already been stopped and thus caused the Broken Pipe issue. The fix in the code is to move the logic to shutdown GRPC server before stopping the object store. In this sense, we can make sure when stopping the Object Store, there will not be on-going or future GRPC requests. Signed-off-by: Mengjin Yan <[email protected]>
Follow up to comment in #48609 that came in after the PR merged. cc: @pcmoritz Signed-off-by: angelinalg <[email protected]> Signed-off-by: angelinalg <[email protected]>
Two issues: 1. When the execution plan caches the dataset schema, it might call `unify_schema` to produce a single schema from all of the bundles' schemas. The issue is that this function expects all of the input schemas to be of Arrow type, but we only check if at least one schema is of Arrow type before calling the function. 2. `to_pandas` iterates over blocks and adds them to `DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder` expects all input blocks to be of the same type. --------- Signed-off-by: Balaji Veeramani <[email protected]>
This PR adds docs to the `dev-workflow.md` complementing the changes in this PR: #48477 --------- Signed-off-by: Promptless Bot <[email protected]> Signed-off-by: Frances Liu <[email protected]> Signed-off-by: frances720 <[email protected]> Co-authored-by: promptless[bot] <179508745+promptless[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]>
## Why are these changes needed? Add Project operator to select_columns. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
This PR integrates `uv` into existing runtime env system, which uses `uv` to setup environment for better performance. TODO list: - [ ] Add `uv` to public documentation --------- Signed-off-by: dentiny <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
…with custom DNS for Kubernetes API. (#48541) Signed-off-by: ltbringer <[email protected]>
and fix windows build Signed-off-by: Lonnie Liu <[email protected]>
In #47958 , visualization unit tests were introduced but would not run if CI environment does not have graphvis installed (which was the case). This PR adds the dependency in test-requirements.txt and always run the visualization tests. It also moves visualization tests to a separate file since test_accelerated_dag.py is too large.
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
## Why are these changes needed? Refactor resolve request arg func. Separate into: 1. actual type checking + argument resolution (injected into the Router as a function) 2. the logic for executing that resolution and replacing args/kwargs (kept as part of Router) --------- Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: hjiang <[email protected]>
Now, when a Ray job is submitted, JobHead starts a Ray actor namely `JobSupervisor` which fork-and-exec a subprocess with the entrypoint. It bears all environment variables from the JobSupervisor, with some additional ones via `_get_driver_env_vars`. However we also want to *remove* some env vars, specifically the Ray internal ones including `RAY_JOB_ID` and `RAY_RAYLET_ID` because they are meant for the JobSupervisor, not for the job we submit. For example, in the entrypoint `$RAY_JOB_ID` can be set to `01000000` because that's the job id of the JobSupervisor, while the entrypoint really is for job 08000000. This did not show up as a bug because these environment variables are not read by the driver core workers. However we find a user pattern that they collect all env vars in the driver process and pass it to tasks/actors via `env_vars`. If they do so, the started workers get confused by the stale job ids and crash. Removes those env vars on JobSupervisor for-and-exec to hide them from the user entrypoint. Next steps: we should check for internal flags in `env_vars` and raise Exceptions for internal flags; and/or we can have an option to auto-remove them for the users. something like: ```py @ray.remote(runtime_env={ "env_vars":os.environ, # can be "raise", "allow_debugonly", "remove", default to "raise" "env_vars_config":{"internal_flags": "remove"}, }) ``` --------- Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: dayshah <[email protected]>
#47752 introduced a minor, silent issue in test_map. I stumbled across it because a linter I'm using complained.
## Why are these changes needed? Match for the more specific error that is raised, in this case `TypeError`, instead of matching for generic `RayTaskError`. --------- Signed-off-by: Cindy Zhang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )